Prosper Data Exploration

Judy Jensen, P4 Final project

Data exploration and analysis

Univariate Plots

subset of prosper loan data

## [1] 113937     83
## 'data.frame':    113937 obs. of  30 variables:
##  $ LoanStatus                 : Ord.factor w/ 12 levels "Current"<"FinalPaymentInProgress"<..: 3 1 3 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ ProsperScore               : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ EmploymentStatus           : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration   : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner        : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CreditScoreRangeUpper      : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine    : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines         : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines            : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts      : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment: num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months       : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries             : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies       : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent           : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years    : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years   : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months  : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance     : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization        : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit    : num  1500 10266 NA 30754 695 ...
##  $ DebtToIncomeRatio          : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                : Ord.factor w/ 8 levels "$0"<"$1-24,999"<..: 3 4 8 3 6 6 3 3 3 3 ...
##  $ StatedMonthlyIncome        : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount         : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate        : Date, format: "2007-09-12" "2014-03-03" ...
##  $ LStatus                    : Ord.factor w/ 3 levels "good standing"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year                       : num  2007 2014 2007 2012 2013 ...
##                  LoanStatus     BorrowerAPR       ProsperScore  
##  Current              :56576   Min.   :0.00653   Min.   : 1.00  
##  Completed            :38074   1st Qu.:0.15629   1st Qu.: 4.00  
##  Chargedoff           :11992   Median :0.20976   Median : 6.00  
##  Defaulted            : 5018   Mean   :0.21883   Mean   : 5.95  
##  Past Due (1-15 days) :  806   3rd Qu.:0.28381   3rd Qu.: 8.00  
##  Past Due (31-60 days):  363   Max.   :0.51229   Max.   :11.00  
##  (Other)              : 1108   NA's   :25        NA's   :29084  
##       EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner
##  Employed     :67322   Min.   :  0.00           False:56459        
##  Full-time    :26355   1st Qu.: 26.00           True :57478        
##  Self-employed: 6134   Median : 67.00                              
##  Not available: 5347   Mean   : 96.07                              
##  Other        : 3806   3rd Qu.:137.00                              
##               : 2255   Max.   :755.00                              
##  (Other)      : 2718   NA's   :7625                                
##  CreditScoreRangeUpper        FirstRecordedCreditLine CurrentCreditLines
##  Min.   : 19.0                            :   697     Min.   : 0.00     
##  1st Qu.:679.0         1993-12-01 00:00:00:   185     1st Qu.: 7.00     
##  Median :699.0         1994-11-01 00:00:00:   178     Median :10.00     
##  Mean   :704.6         1995-11-01 00:00:00:   168     Mean   :10.32     
##  3rd Qu.:739.0         1990-04-01 00:00:00:   161     3rd Qu.:13.00     
##  Max.   :899.0         1995-03-01 00:00:00:   159     Max.   :59.00     
##  NA's   :591           (Other)            :112389     NA's   :7604      
##  OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts
##  Min.   : 0.00   Min.   :  2.00             Min.   : 0.00        
##  1st Qu.: 6.00   1st Qu.: 17.00             1st Qu.: 4.00        
##  Median : 9.00   Median : 25.00             Median : 6.00        
##  Mean   : 9.26   Mean   : 26.75             Mean   : 6.97        
##  3rd Qu.:12.00   3rd Qu.: 35.00             3rd Qu.: 9.00        
##  Max.   :54.00   Max.   :136.00             Max.   :51.00        
##  NA's   :7604    NA's   :697                                     
##  OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries   
##  Min.   :    0.0             Min.   :  0.000      Min.   :  0.000  
##  1st Qu.:  114.0             1st Qu.:  0.000      1st Qu.:  2.000  
##  Median :  271.0             Median :  1.000      Median :  4.000  
##  Mean   :  398.3             Mean   :  1.435      Mean   :  5.584  
##  3rd Qu.:  525.0             3rd Qu.:  2.000      3rd Qu.:  7.000  
##  Max.   :14985.0             Max.   :105.000      Max.   :379.000  
##                              NA's   :697          NA's   :1159     
##  CurrentDelinquencies AmountDelinquent   DelinquenciesLast7Years
##  Min.   : 0.0000      Min.   :     0.0   Min.   : 0.000         
##  1st Qu.: 0.0000      1st Qu.:     0.0   1st Qu.: 0.000         
##  Median : 0.0000      Median :     0.0   Median : 0.000         
##  Mean   : 0.5921      Mean   :   984.5   Mean   : 4.155         
##  3rd Qu.: 0.0000      3rd Qu.:     0.0   3rd Qu.: 3.000         
##  Max.   :83.0000      Max.   :463881.0   Max.   :99.000         
##  NA's   :697          NA's   :7622       NA's   :990            
##  PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
##  Min.   : 0.0000          Min.   : 0.000            Min.   :      0       
##  1st Qu.: 0.0000          1st Qu.: 0.000            1st Qu.:   3121       
##  Median : 0.0000          Median : 0.000            Median :   8549       
##  Mean   : 0.3126          Mean   : 0.015            Mean   :  17599       
##  3rd Qu.: 0.0000          3rd Qu.: 0.000            3rd Qu.:  19521       
##  Max.   :38.0000          Max.   :20.000            Max.   :1435667       
##  NA's   :697              NA's   :7604              NA's   :7604          
##  BankcardUtilization AvailableBankcardCredit DebtToIncomeRatio
##  Min.   :0.000       Min.   :     0          Min.   : 0.000   
##  1st Qu.:0.310       1st Qu.:   880          1st Qu.: 0.140   
##  Median :0.600       Median :  4100          Median : 0.220   
##  Mean   :0.561       Mean   : 11210          Mean   : 0.276   
##  3rd Qu.:0.840       3rd Qu.: 13180          3rd Qu.: 0.320   
##  Max.   :5.950       Max.   :646285          Max.   :10.010   
##  NA's   :7604        NA's   :7544            NA's   :8554     
##          IncomeRange    StatedMonthlyIncome LoanOriginalAmount
##  $25,000-49,999:32192   Min.   :      0     Min.   : 1000     
##  $50,000-74,999:31050   1st Qu.:   3200     1st Qu.: 4000     
##  $100,000+     :17337   Median :   4667     Median : 6500     
##  $75,000-99,999:16916   Mean   :   5608     Mean   : 8337     
##  Not displayed : 7741   3rd Qu.:   6825     3rd Qu.:12000     
##  $1-24,999     : 7274   Max.   :1750003     Max.   :35000     
##  (Other)       : 1427                                         
##  LoanOriginationDate           LStatus           year     
##  Min.   :2005-11-15   good standing:94855   Min.   :2005  
##  1st Qu.:2008-10-02   late         : 2067   1st Qu.:2008  
##  Median :2012-06-26   default      :17010   Median :2012  
##  Mean   :2011-07-21   NA's         :    5   Mean   :2011  
##  3rd Qu.:2013-09-18                         3rd Qu.:2013  
##  Max.   :2014-03-12                         Max.   :2014  
## 
Data - Initial Observations
  • 114k observations of 81 variables! Looked for natural groupings of the variable to provide some focus for the exploration.
  • borrower details - income, debt to income ratio, FICO score, home ownership, credit status (revolving credit, etc, customer #, employment status, etc)
  • Prosper calculation of borrowers credit-worthiness
  • Loan parameters (amount paid to date, amount borrowed, duration, monthly payment, payment status)
  • Loan price (APR, fees)
  • Marketing / referrals

I will primarily focus on the borrower details, loan parameters, and prosper’s calculation of creditworthiness as they relate to the price of the loan, and also on the payment status of the loan (good, late, default).

Data Wrangling

Created two new variables - Year (year loan initiated) and LStatus (consolidated LoanStatus into 3 levels - good, late, default). In addition, I created ordered factors for IncomeRange and LoanStatus so these plots would be logical.

Overview
  1. Over 80% of Prosper Loans are complete or current (LStaus = good standing: (95k / 114k).

  2. Borrower Details
  1. Loan details (range from 1st Qu to 3rd Qu)
  • APR between 15.6 and 28.3%
  • Loan size between 4000 and 12000
  • Term is 36 mos on majority of loans

These distributions are for variables related to income.

  • Most borrowers have income between $25k and $75k.
  • Plotting the monthly income X 12 (Annual Income) reveals more detail about the distribution which is a normal distribution
  • The length of employment status is not normal, and is unexpected from the point of view that the first quartile of borrowers have been employed for just over 2 years (26 mos). This may be due to changes in employment status, but also possibly the attractiveness to newer workers of a new loan type.

These distributions are related to credit-worthiness for Prosper borrowers and were selected to reflect the variety of variables in the dataset. These variables will be tuned in the bivariate analysis.
+ Debt to income ratio is a median of 0.22
+ Credit Line, 7 years has a median of 25
+ Current credit lines is 10.
the above plots all appear to be normal distributions.
+ Revolving Monthly Payment distributions is not normal more skewed to lower payments
+ Bankcard utilization is not normal, and high with a median of 60%.

(closer look at Revolving Monthly Payments does not reveal additional interesting information)

Loan amounts have distinct bands at increments of $5k with 5k, 10k and 15k being the most popular loan amounts.
APR also appear to group at slightly below 10%, 20%, and 30% with a peak at 0.36 which is primarily from 2011 and 2012.

The range of FICO scores (CreditScoreRangeUpper) has a small peak at ~ 550, and then a peak near the median of 699. The distribution on the ProsperScore is unusual with several spikes along the peak of the distribution.

After an SEC initiated in hiatus in 2009, the number of loans is growing at a strong rate. Source: https://en.wikipedia.org/wiki/Prosper_Marketplace

Summary of Univariate Analysis

  1. Borrower Income Details
  • Median annual income is 56k
  • Median employment duration is 5y 7 mos
  1. Borrower Credit Details
  • Debt to income ratio is a median of 0.22
  • Credit Line, 7 years has a median of 25
  • Current credit lines median is 10.
  • Revolving Monthly Payment median is $271
  • Bankcard utilization is high with a median of 60%.
  1. Loan Details
  • Median APR is 21.0%
  • Median loan size is 6500, middle range is 4k-12k
    Bivariate analysis will be utilized to determine which of these factors correlates most strongly with loan details and loan status. At this point, there is no obvious additional data shaping required beyond the items already discussed - creating LStatus from LoanStatus and updating certain factors to ordered factors.

Bivariate Plots

The following variables have the highest correlation to the APR(price) of the loan -
positive correlation (increase APR): Delinquencies Last 7 Years, Inquiries last 6 Months
negative correlation (decrease APR): Prosper Score, Credit Score Range Upper, Available Bankcard Credit, Loan Original Amount

Review key variables against BorrowerAPR

Annual Income does have an effect on APR, borrower APR declines ~ 5% between 25k and 125k and then is no longer affected by income.
There does not appear to be any correlation between APR and the length of employment.

This set of charts that shows a linear relationship between ProsperScore and CreditScoreUpper above 6 / 700 respectively but not at lower credit ratings.
The plots of Prosper score and Credit Score indicate why this is the case - APR remains flat with CreditScoreUpper until about 579, and doesn’t decrease noticeably until 680-700. The ProsperScore plot is more linear.

## 
##  Pearson's product-moment correlation
## 
## data:  prosper$ProsperScore and prosper$CreditScoreRangeUpper
## t = 115.87, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3637793 0.3753979
## sample estimates:
##      cor 
## 0.369603
## 99% 
##  11

These charts provide additional information from the correlation values to APR from the ggpairs calculation.
Loan Original Amount shows two peaks in APR at loans of 3k and at 7.5k
The curves flatten eventually indicating there is an upper and lower limit to the borrower APR, (although the loans are still issued at high risk values.)

The distribution of loans at various APRs is noticably different between borrowers w/ AvailableBankcardCredit = 0 and != 0. The rates are higher for borrowers w/ no Bankcard credit. Also, the peak at 0.36 is more noticeable for borrowers with no Bankcard credit, although the peak exists on both plots.
Note difference in y-scale

Review credit rating variables and APR v. Loan Status

These charts indicate that the price of the loan does take into consideration the associated risk based on the borrowers credit scores. The prosperscore shows a larger relative difference between Loan Status result than the CreditScoreRangeUpper.

## prosper.na$LStatus: good standing
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   4.000   6.000   6.017   8.000  11.000   18410 
## -------------------------------------------------------- 
## prosper.na$LStatus: late
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   5.064   7.000  11.000 
## -------------------------------------------------------- 
## prosper.na$LStatus: default
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   4.000   5.000   5.427   7.000  11.000   10669
## prosper.na$LStatus: good standing
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   679.0   719.0   712.5   739.0   899.0     416 
## -------------------------------------------------------- 
## prosper.na$LStatus: late
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   619.0   679.0   699.0   707.1   739.0   879.0 
## -------------------------------------------------------- 
## prosper.na$LStatus: default
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   619.0   659.0   659.8   719.0   879.0     174
## prosper.na$LStatus: good standing
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15020 0.20270 0.21180 0.27250 0.51230      25 
## -------------------------------------------------------- 
## prosper.na$LStatus: late
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06327 0.21830 0.27280 0.26740 0.32580 0.38490 
## -------------------------------------------------------- 
## prosper.na$LStatus: default
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00864 0.18690 0.25440 0.25220 0.30910 0.50630

Build Linear Model

## 
## Calls:
## m1: lm(formula = BorrowerAPR ~ ProsperScore, data = prosper.na)
## m2: lm(formula = BorrowerAPR ~ ProsperScore + CreditScoreRangeUpper, 
##     data = prosper.na)
## m3: lm(formula = BorrowerAPR ~ ProsperScore + CreditScoreRangeUpper + 
##     AvailableBankcardCredit, data = prosper.na)
## m4: lm(formula = BorrowerAPR ~ ProsperScore + CreditScoreRangeUpper + 
##     AvailableBankcardCredit + log10(LoanOriginalAmount), data = prosper.na)
## m6: lm(formula = BorrowerAPR ~ ProsperScore + CreditScoreRangeUpper + 
##     AvailableBankcardCredit + log10(LoanOriginalAmount) + InquiriesLast6Months, 
##     data = prosper.na)
## 
## ======================================================================================
##                                m1          m2          m3          m4          m6     
## --------------------------------------------------------------------------------------
## (Intercept)                  0.360***    0.730***    0.701***    0.872***    0.858*** 
##                             (0.001)     (0.003)     (0.003)     (0.003)     (0.003)   
## ProsperScore                -0.022***   -0.018***   -0.018***   -0.017***   -0.016*** 
##                             (0.000)     (0.000)     (0.000)     (0.000)     (0.000)   
## CreditScoreRangeUpper                   -0.001***   -0.001***   -0.000***   -0.000*** 
##                                         (0.000)     (0.000)     (0.000)     (0.000)   
## AvailableBankcardCredit                             -0.000***   -0.000***   -0.000*** 
##                                                     (0.000)     (0.000)     (0.000)   
## log10(LoanOriginalAmount)                                       -0.060***   -0.059*** 
##                                                                 (0.001)     (0.001)   
## InquiriesLast6Months                                                         0.004*** 
##                                                                             (0.000)   
## --------------------------------------------------------------------------------------
## R-squared                        0.447       0.537       0.540       0.591       0.596
## adj. R-squared                   0.447       0.537       0.539       0.591       0.596
## sigma                            0.059       0.054       0.054       0.051       0.051
## F                            68477.863   49144.732   33136.421   30694.583   25035.388
## p                                0.000       0.000       0.000       0.000       0.000
## Log-likelihood              119107.488  126645.604  126904.499  131970.791  132458.413
## Deviance                       299.890     251.072     249.545     221.456     218.925
## AIC                        -238208.976 -253283.207 -253798.997 -263929.583 -264902.825
## BIC                        -238180.930 -253245.813 -253752.254 -263873.490 -264837.385
## N                            84853       84853       84853       84853       84853    
## ======================================================================================
## 
## Call:
## lm(formula = BorrowerAPR ~ ProsperScore + CreditScoreRangeUpper + 
##     AvailableBankcardCredit + log10(LoanOriginalAmount) + InquiriesLast6Months, 
##     data = prosper.na)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.196122 -0.035646 -0.005166  0.033676  0.211848 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8.579e-01  3.431e-03  250.03   <2e-16 ***
## ProsperScore              -1.635e-02  8.470e-05 -193.02   <2e-16 ***
## CreditScoreRangeUpper     -4.307e-04  4.397e-06  -97.94   <2e-16 ***
## AvailableBankcardCredit   -2.034e-07  1.093e-08  -18.60   <2e-16 ***
## log10(LoanOriginalAmount) -5.877e-02  5.756e-04 -102.10   <2e-16 ***
## InquiriesLast6Months       4.138e-03  1.321e-04   31.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0508 on 84847 degrees of freedom
##   (29079 observations deleted due to missingness)
## Multiple R-squared:  0.596,  Adjusted R-squared:  0.596 
## F-statistic: 2.504e+04 on 5 and 84847 DF,  p-value: < 2.2e-16

Summary of Bivariate Analysis

Several variables contribute to the Borrower APR (price of the loan), including ProsperScore CreditScoreRange, AvailableBankcardCrdit, LoanOriginalAmount, BankcardUtilization, Delinquencies(7year), CurrentDelinquincies, and Inquiries(6mos).
* Bankcard Utiliazation correlated to Available Bankcard Credit (-0.37)
* Current Delinquencies correlated to Delinquencies(7 year) (0.384)
* These variables were not included in the bivariate analysis nor the model to avoid issues with collinear variables

The linear model has an R^2 factor of 0.60. this is a moderate result and reflects applying a linear model to non-linear relationships between the variables (NOTE - log10 relationships were explored for certain variables and adopted for LoanOriginalAmount)

Income shows a weak correlation to BorrowerAPR (-0.14). it did not improve the model (R), and is not included in the model.

It is assumed that Borrower APR reflects the risk of the loan. A borrower has to pay more if they are more likely to default.
* The BorrowerAPR is a good reflection of loan risk - the median borrowerAPR forloans in good standing is 20% and 25.4% for loans in default.
*The ProsperScore appears to be a better predictor of loan status than CreditScoreUpper with a larger difference of scores based on LStatus
+ ProsperScore 6 (good) to 5 (default) = 16% + CreditScoreUpper 719 (good) to 659 (default) = 8%

Multivariate Plots

While the difference in mean credit scores has decreased over time for loans in good v negative standing, the mean prosper score has diverged, indicating it has become a better predictor of whether a loan will be good or negative status

This set of plots investigates the % of a loan that is paid off. The first plot demonstrates loans with >1% of the total loan still due (effectively, all unpaid loans).
* the trend for higher APRs at good standing is due to the APR rate of the loan when it was issued and the length of time the loan has been open and one should not draw conclusions on this trend.
* the trend for default loans is interesting, as these loans are closed as unpaid. It is clear that borrowers pay more of the loan when then have a lower interest rate than when they have a higher interest rate! More investigation is needed to better understand this trend, but if it hold up, it has interesting implications when pricing loans. Financially it may be more beneficial to receive back more principle and less interest.

the set of charts of distributions with completed and uncompleted loans has an expected trend of higher completed loans, and loans in good standing distributed at lower APRs, and loans with negative standing distributed at higher APRs.

This graph plots the change in APR by Loan status by year. There are two notable trends in this graph.
* APRs are consistently and noticeably lower for Loans with positive status but this gap becomes very small in 2011 based on increased APRs for loans in good standing.
* In 2011, APRs for loans tihe begative status (late or default), declined very slightly from 2010, while te APR for loans in good standing increased as noted in the previous item.
The fluctuations of the APR by year also indicate the economic conditions - although it appears it takes longer for APRs to come down that it does for them to go up.

there is a very noticeable improvement in loan status, even during the economic downturn from Q4-2007 to Q2-2009 (note that the year is inital year of the loan). 2010 is the year that prosper reset after SEC intervention, and also when they introduced their ProsperScore.

These graphs of the key variables affecting Borrower APR now indicate the status of the loan (good, late, or default). The graphs support that the APR is appropriatedly assigned to price a higher risk more than a lower risk loan since good standing loans have lower rates than defauled loans. late loans are not consistent.
* Two notable observations
+ for larger loans (greater than 4.5 (log10)), loans with negative status have a lower APR than loans with good status this is not desirable
+ as inquires last 6 months increases, the APR for default loans remains exactly the same, regardless of the number of inquires where APR increased relative to number of inquires for good standing and for late loans.

Summary of Multivariate plots

  • Added further depth to the bivariate analysis by including loan status information and loan payment information, as well as investigating changes over time.
  • The prosper score is a better indicator of loan status than credit score, and the delta has increased over time.
  • Plotting variables that affect APR with loan status indicate there are some inconsistencies in the assumption that higher APR results in higher default at higher loan values and at higher credit inquiries.
  • Borrowers with a higher APR that are in negative status tend to pay off a lower portion of their loan. However, APR is affected by year (economic and competitive factors) in addition to the credit worthiness of the borrower so there are other factors at play.
  • The number and % of loans in good standing has increased noticeably based on the year the loan was initiated. Once again, this is a factor of economic factors and also better ability to grant and price loans accordingly.

Final Plots and Analysis

Graph 1 - ProsperScore v. Loan Status

  • The ProsperScore is the Prosper internal designation of the credit-worthiness of a loan, and it is the variable that correlates highest to the APR that is assigned to the loan (-0.668). Compared to the externally available CreditScoreUpper(FICO score) variable (correlates -0.431), the ProsperScore is a better score.
  • This plot demonstrates that loans in good standing have a 20% higher ProsperScore than late and default loans.
  • The conclusion from this plot is that the ProsperScore is a good indicator of the borrower being in good standing (Good standing = Loan is current, or completed)
  • The business implication is that the more accurate a borrowers creditworthiness can be determined, the better the APR can be optimized. This is a competitive advantage in the marketplace for loans.

Graph 2 - Variables affecting Borrower APR

  1. This set of plots includes the variables with the highest correlation to APR for prosper loans, and also plots the status of loans
  • The Avaliable Bankcard Credit (correlation -0.347) and Original loan amount (correlation (log10) -0.316) both drive a lower APR as their values increase
  • Delinquencies (correlation 0.157) and Credit Inquiries (correlation 0.14) both drive a higher APR as their values increase
  1. The linear model built with these variables (+prosperscore and creditscoreupper) has an R^2 value of 0.60, which indicates a modest fit.
  • All of these graphs are non-linear, particularly at the extremes of the variable
  • APR peaks on the Original Loan Amount plot at $3k and $7.5k. A log scale is shown here because it correlates better, and peaks are noticable at 10^3.5 and 10^3.8. All loan status levels follow this trend.
  • for larger loans (greater than ~10^4.5 = $30k), loans with negative status have a lower APR than loans with good status this is not desirable.
  1. The graphs support that the APR is appropriatedly assigned to price a higher risk more than a lower risk loan since good standing loans have lower rates than defauled loans. late loans are not consistent.
    Conclusion - Variables that negatively affect APR also trend to higher negative loan status, and thhese charts indicate that the extreme values of the variables is where the linearity is the poorest fit.

Reflections

I’m astounded and stoked about the level of information in this project.
The number of variables created a huge challenge. It took considerable effort to determine the focus of this project - specifically the borrowers creditworthiness, the APR, and the relationship between APR and Loan Status.
I spent several hours trying to correlate Loan Status with various factors, but the analysis was really uninteresting. Once I determined that the APR, or price of the loan, was the more important factor, and then consolidated the LoanStatus factors into Good, Late, Default, the project went very smoothly.

The data set is so rich, and a team of analyst are certainly busy at Prosper. This analysis focuses on the price of loans (APR) and their status compared to loan parameters and the borrowers creditworthiness.
A few items were surprising -

  • There is almost no correlation between the borrowers income and the borrowers APR
  • The correlation between Loan amount and APR is less linear than expected, with a few peaks, and declining rates at higher loan amounts
  • Prosper has a better algorithm than the credit companies for determining credit scores as it relates to assigning APRs that are an appropriate reflection of risk
  • Prosper’s ability to gather data on loans as they are paid on a real time basis and modify their models is a big advantage over the credit reporting companies

Points for further investigation

  • A few variables were investigated over time to explain charts, but given the external variables that affect APR, an entire additional project could be completed on correlation over time, or at least pre 2009 and post 2009. Tying in external data would be particularly interesting - what portion of Prosper’s improved results is due to an improved economy, and what is due to improved modeling?
  • Financials - investigate what segment of loans is the most profitable / least profitable. Determine if some of the prosper generated variables correlate to the profitability of the loans
  • Marketing - there are very few referrals in the database, but as a prosper lender, I am aware they have regular programs to encourage lending and borrowing. Exploring these relationships would be an interesting project as well.
  • compare correlation pre and post 2009
  • compare financials of higher and lower APR given default rates